Skip or Replay? Predicting Song Preferences on Spotify Using Machine Learning (Group L)

Authors
Affiliation

Junod, Alexander

Université de Lausanne - Faculty of Business and Economics (HEC)

Leroy, Camille

Menten, Arthur

Published

May 19, 2024

Abstract

In our project, “Skip or Replay? Predicting Song Preferences on Spotify Using Machine Learning,” we aimed to develop models that accurately predict personal music preferences based on audio characteristics and metadata. Using the Spotify API, we compiled a dataset of 400 tracks from Spotify, each labeled with likes or dislikes to reflect individual musical preferences. This dataset included auditory attributes like Tempo, Energy, and Danceability, alongside metadata such as Artist Popularity and Genre for each track. In our exploratory data analysis, we focused on understanding the distribution and relationship of 22 variables in relation to the ‘Liked’ and ‘Disliked’ status of tracks. Key findings highlighted preferences for songs with higher Valence, lower Speechiness, and non-Explicit lyrics. Principal Component Analysis revealed the complexity of our dataset, with significant variables showing clear influences on song preferences, setting the stage for targeted predictive modeling. In our study, we divided our dataset into three subsets: training, validation and test. We used these subsets to assess the performance of several baseline models. Specifically, we evaluated Logistic Regression, Classification Trees, SVM and Random Forests, along with a simple Naïve baseline model for comparison. To ensure the robustness of our evaluations, we employed 5-fold cross-validation. All these models were implemented using the caret package in R, which provides tools for model training, parameter tuning, and performance assessment. Logistic Regression and Random Forest emerged as the most effective, demonstrating superior predictive capabilities using the Balanced Accuracy, Kapp , Precision, Recall and AUC metrics. These models were chosen for hyperparameter tuning due to their ability to accurately differentiate between Liked and Disliked songs. The tuning significantly improved the models’ performance across all metrics demonstrating their robustness on both validation and test datasets. Ultimately, both models exhibited strong ability to generalize to unseen data, with Logistic Regression showing slight improvements in precision and Random Forest displaying notable gains in recall and overall predictive strength. The machine learning model interpretation analysis revealed that both Logistic Regression and Random Forest models identified ‘Valence’, ‘Explicit’, ‘WordCount’, and ‘GenreCount’ as key predictors in determining a song likability, demonstrating their consistent influence across different modeling approaches, as well as validating our EDA findings. While Logistic Regression focused on these specific features, highlighting their direct impact, the Random Forest model was able to detect subtler cues such as ‘Acousticness’ and ‘AvgSegDuration’, showcasing its adeptness at handling complex interactions between features. Building on this understanding, the analysis using Partial Dependence Plots (PDPs) further elucidated differences between the two models in their approach to predicting song likability. Logistic Regression displayed a broad and more linear sensitivity to changes in features like Valence, Explicitness and Instrumentalness, often showing clear almost linear trends in how probability changes with feature values (i.e., as the variable increased the probability increased linearly). In contrast, Random Forest exhibited more consistent probabilities across these features, demonstrating a robustness to individual feature changes and a superior capability to manage non-linear relationships and feature interactions.

1 Honor Pledge

This project was written by us and in our own words, except for quotations from published and unpublished sources, which are clearly indicated and acknowledged as such. We are conscious that the incorporation of material from other works or a paraphrase of such material without acknowledgement will be treated as plagiarism, subject to the custom and usage of the subject, according to the University Regulations. The source of any picture, map or other illustration is also indicated, as is the source, published or unpublished, of any material not resulting from our own research.

Camille and Alex’s Signature Arthur’s Signature

2 Introduction

2.1 Context and Background

In the digital age, music streaming platforms such as Spotify and Apple Music have revolutionized how people access and enjoy music, offering personalized listening experiences and access to a vast array of new songs and genres. Spotify, which commands 31.4% of the global digital music streaming market with 1.68 billion users (Spotify User Stats, Updated March 2024), provides developers and analysts with extensive access to its data through its web API. This wealth of data made Spotify an ideal resource for our machine learning project “Skip or Replay? Predicting Song Preferences on Spotify Using Machine Learning”, allowing us to explore various analytical methods, aligning with the objectives of our Machine Learning in Business Analytics course at the Université de Lausanne - Faculty of Business and Economics (HEC).

2.2 Project Goal (Research Question)

The primary goal of this exercise was to develop predictive machine learning models that can accurately predict whether we will like or dislike a song, purely based on its audio characteristics and metadata. Therefore, our research question can best be defined as the following: “How can we utilize machine learning models and techniques to predict our personal preference for a song based on its audio characteristics and metadata, and what are the key features that determine whether a song will be liked or disliked by us?”

2.3 Description of the Data

The data used for this project was sourced directly from Spotify’s Web API. Each member of our team compiled two playlists—one with liked songs and the other with disliked songs. After a subjective review, we chose to use the playlists created by our teammate Camille, as the tracks in her playlists exhibited more distinct patterns compared to others within our group (i.e., the difference between her Liked and Disliked songs appeared more distinct). We then gathered a comprehensive collection of attributes for the tracks of each playlist into a single dataset. This dataset included both audio properties, such as Tempo, Energy, and Danceability, and metadata like Artist Popularity and Genre. Initially received in JSON format, the data was converted into CSV files to facilitate the processing. These foundational features were crucial for modeling and uncovering underlying preferences towards different musical elements.

2.4 Methodology

In this machine learning project, we utilized a comprehensive model-based machine learning approach to quantify and predict preferences for Spotify tracks. To achieve this, we deployed a collection of machine learning models, each capable of analyzing song features to determine the likelihood of a user liking or disliking a song. The primary models employed included Logistic Regression, Classification Trees, Support Vector Machines (SVM), and Random Forests. Each model was initially tested in its baseline form (i.e., no tuning of hyperparameters) to assess its basic performance in predicting user preferences based on predefined metrics (Balanced Accuarcy, Kappa, Precision, Recall and AUC). We believe that this approach was best given two reasons: firstly, this provided us a unique opportunity to implement as much of the MLBA class learnings as possible in our report, secondly, we also knew that “no one model works best for all possible situations” (Brownlee, 2021). Then by comparing the quality of these baseline models on a defined validation set, we identified the models that best captured the nuances of musical preference (i.e., which were performing the best prior to any tuning). Following the initial evaluation on our validation set, we further refined our approach by selecting the two models for further hyperparameter tuning. This tuning process was designed to optimize the models’ predictive power by adjusting various hyperparameters in an attempt to increase its predictive quality further than its baseline results. This methodical enhancement helped in sharpening the accuracy and reliability of our predictive analytics, ensuring that the final models would generalize to unseen data in the highest quality possible. This dual-phase approach—initial broad comparison followed by focused optimization—ensured that our final predictive models were both robust and finely tuned to the specifics of individual user preferences in music. In addition, this offered our team the learning opportunity to code, implement and interpret more than two models in this project.

See the figure below for a visual representation of the steps taken in our project’s methodology.

Our methodology explained - Inspired by (Basavaraju et al., 2019)

3 Data

This section outlines the specifics regarding the dataset utilized by our team to predict if we would Like or Dislike a song, focusing on the features, data format, instances, and any pre-processing steps undertaken.

3.1 Data Collection and Processing Using Spotify’s API

As outlined in Section 1, we created our dataset by extracting the audio features and metadata from the tracks contained in two playlists that we created, one containing “Liked” tracks and the other “Disliked” tracks. The figure below illustrates the process we used to obtain data from Spotify’s Web API. Our approach involved scripting in Python with the ‘spotipy’ Python library (Welcome to Spotipy! — Spotipy 2.0 Documentation, n.d.), which facilitated API interaction through more simplified function calls.

Spotify Web API Script/Call Explained

3.2 Implementation of Batching due to Data Volume and API Limitations

Given the richness and volume of the data we were obtaining from Spotify, we faced several challenges:

  1. Data Volume: Our Spotify playlists contained hundreds of songs. Fetching detailed data for each of the tracks, including audio features and metadata, significantly increased the volume of data requested and retrieved per API call.
  2. API Rate Limits: Spotify imposes rate limits on how many requests can be made to their API within a certain time frame. Exceeding these limits resulted in a ‘429 Too Many Requests’ error, causing our script to constantly time out and halt.

To efficiently manage the volume of data and adhere to API constraints and after much trial and error, we successfully implemented a batching process:

  • Track URI Collection: Initially, we collected all track URIs from the playlist using paginated results from ’playlist_tracks();. This method ensured we didn’t miss any tracks due to pagination limits.

  • Batch Processing for Audio Features: We fetched audio features in batches of 50 tracks per API call. This batching not only streamlined data retrieval but also minimized the risk of hitting rate limits, as fewer API requests were made in total. (e.g., one API call per track instead of 10 with batching)

The batching technique and error handling mechanisms were crucial in efficiently retrieving large datasets from Spotify without interruption. This approach not only optimized our data collection process but also provided a robust framework to handle API limitations gracefully, ensuring comprehensive data retrieval for our predictive modeling project.

3.3 Data-set Format

The dataset received from the Spotify Web API was in a JSON format. Given that our scripting code was written in Python, our code converted the pandas DataFrame into a csv file, which was then saved and imported into our R environment for further processing and analysis. This conversion facilitated the integration of the dataset with our analysis tools, allowing for more sophisticated data manipulation and visualization.

Code
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

client_id = '3bc451a66f2d428f8b5fed829a8256e2'
client_secret = '96576bbc3b014d769549b05954e23d33'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# URLs of the playlists for Camille
liked_playlist_link = 'https://open.spotify.com/playlist/3hHsoFEOSBttbLkd3YGyLU' # FINAL PLAYLIST
disliked_playlist_link = 'https://open.spotify.com/playlist/5kci9g0ZiOSSP3MqJS6C9I' # 

# URLs of the playlists for Alex 
#liked_playlist_link = 'https://open.spotify.com/playlist/5s4m4kmikUzGf4LvX1Lms7'
#disliked_playlist_link = 'https://open.spotify.com/playlist/6cq1EaOcKqxCC0wjOXpAv3'


liked_playlist_URI = liked_playlist_link.split('/')[-1].split('?')[0]
disliked_playlist_URI = disliked_playlist_link.split('/')[-1].split('?')[0]

def fetch_tracks(playlist_URI, like_label, sp):
    tracks_data = []
    results = sp.playlist_tracks(playlist_URI)
    track_items = results['items']
    while results['next']:
        results = sp.next(results)
        track_items.extend(results['items'])

    track_uris = [item['track']['uri'] for item in track_items]

    # Batch fetch audio features and audio analysis
    audio_features_list = []
    for i in range(0, len(track_uris), 50):
        audio_features_list.extend(sp.audio_features(track_uris[i:i+50]))

    for item, audio_features in zip(track_items, audio_features_list):
        track_details = item['track']
        artist_info = sp.artist(track_details['artists'][0]['uri'])
        album_info = sp.album(track_details['album']['uri'])
        audio_analysis = sp.audio_analysis(track_details['uri'])

        avg_segment_duration = sum(seg['duration'] for seg in audio_analysis['segments']) / len(audio_analysis['segments']) if audio_analysis['segments'] else 0
        avg_segment_loudness = sum(seg['loudness_max'] for seg in audio_analysis['segments']) / len(audio_analysis['segments']) if audio_analysis['segments'] else 0
        avg_segment_timbre = [sum(seg['timbre'][i] for seg in audio_analysis['segments']) / len(audio_analysis['segments']) for i in range(12)] if audio_analysis['segments'] else [0]*12
        avg_section_tempo = sum(section['tempo'] for section in audio_analysis['sections']) / len(audio_analysis['sections']) if audio_analysis['sections'] else 0

        track_data = {
            'Track URI': track_details['uri'],
            'Track Name': track_details['name'],
            'Artist URI': track_details['artists'][0]['uri'],
            'Artist Name': track_details['artists'][0]['name'],
            'Artist Popularity': artist_info['popularity'],
            'Artist Genres': ', '.join(artist_info['genres']),
            'Artist Followers': artist_info['followers']['total'],
            'Album Name': album_info['name'],
            'Album Popularity': album_info['popularity'],
            'Release Date': album_info['release_date'],
            'Track Popularity': track_details['popularity'],
            'Explicit': track_details['explicit'],
            'Danceability': audio_features['danceability'],
            'Energy': audio_features['energy'],
            'Key': audio_features['key'],
            'Loudness': audio_features['loudness'],
            'Mode': audio_features['mode'],
            'Speechiness': audio_features['speechiness'],
            'Acousticness': audio_features['acousticness'],
            'Instrumentalness': audio_features['instrumentalness'],
            'Liveness': audio_features['liveness'],
            'Valence': audio_features['valence'],
            'Tempo': audio_features['tempo'],
            'Duration_ms': audio_features['duration_ms'],
            'Time Signature': audio_features['time_signature'],
            'Average Segment Duration': avg_segment_duration,
            'Average Segment Loudness': avg_segment_loudness,
            'Average Segment Timbre': avg_segment_timbre,
            'Average Section Tempo': avg_section_tempo,
            'Liked': like_label
        }
        tracks_data.append(track_data)
    return tracks_data

liked_tracks = fetch_tracks(liked_playlist_URI, 1, sp)
disliked_tracks = fetch_tracks(disliked_playlist_URI, 0, sp)
all_tracks = liked_tracks + disliked_tracks

df = pd.DataFrame(all_tracks)
df.to_csv('df_alex.csv', index=False)
print(df.head())

3.4 Data-set Description and Features

The following table outlines the features of our dataset which each provided key insights into different aspects of each song. Our complete dataset (containing all tracks from the two playlists combined) contains a total of 400 songs, each categorized either as Liked or Disliked based on our teammate Camille’s music preferences. Specifically, the playlist included 200 tracks that were Liked and 200 tracks that were Disliked. Each song within the playlists were labeled (i.e., 1 for Liked and 0 for Disliked) according to these preferences, which served as our target variable for our predictive modeling. The table below shows the Feature name, its Type, an example from our dataset as well as a comprehensive explanation behind the variable. In order to fully understand these features, we’ve also integrated a HTML player just below which allows you to sample a 15 second segment of the tracks in our playlists containing the highest and the lowest values of each audio feature (e.g., high acoustic and low acoustic track). If you’re interested in learning more about each of Spotify’s Web API features you can also check out their documentation website which is extremely well documented (Web API | Spotify for Developers, n.d.).

High Feature Audio

Low Feature Audio

Feature High Low
Danceability Low Down (Lil Baby - Low Down (Audio) - YouTube, n.d.) How to disapear completlely (How to Disappear Completely, 2020)
Energy No Half Measures (Unique Leader Records, 2020) Strange Fruit (Strange Fruit, 1939)
Key i like the way you kiss me ((18) Artemas - i like the Way You Kiss Me (Official Music Video) - YouTube, n.d.) Austin (Austin, 2020)
Loudness Not Afraid ((18) Eminem - Not Afraid (Lyrics) - YouTube, n.d.) Strange Fruit (Strange Fruit, 1939)
Mode A Bar Song (Tipsy) (Shaboozey, 2024) Hell N Back (Bakar - Hell N Back (Official Video), n.d.)
Speechiness Hell N Back (Bakar - Hell N Back (Official Video), n.d.) I Just Called To Say I Love You (I Just Called to Say I Love You, 1984)
Acousticness Strange Fruit (Strange Fruit, 1939) Jesus Christ Pose - Remastered 2016 (Jesus Christ Pose, 1991)
Instrumentalness How to disapear completlely (How to Disappear Completely, 2020) A Bar Song (Tipsy) (Shaboozey, 2024)
Liveness Doorman (Doorman, 2021) Fuel (Fuel, 1997)
Valence Big Yellow Taxi ((18) Joni Mitchell - Big Yellow Taxi (Official Lyric Video) - YouTube, n.d.) Doorman (Doorman, 2021)
Tempo Hell N Back (Bakar - Hell N Back (Official Video), n.d.) Where Do Broken Hearts Go (Where Do Broken Hearts Go, 1988)
Time Signature Jfais mes affaires ((18) Djadja & Dinaz - J’fais Mes Affaires [Clip Officiel] - YouTube, n.d.) Money - 2011 Remastered (Money, 1973)
Average Seg Duration Strange ((21) Celeste - Strange (Official Video) - YouTube, n.d.) Rush (Rush, 2023)
AvgSegLoudness Not Afraid ((18) Eminem - Not Afraid (Lyrics) - YouTube, n.d.) Strange Fruit (Strange Fruit, 1939)
AvgSegTempo Hell N Back (Bakar - Hell N Back (Official Video), n.d.) Where Do Broken Hearts Go (Where Do Broken Hearts Go, 1988)

3.4.1 Feature Selection & Reordering

We selectively retained features that contributed significantly to our analysis, removing those that were redundant or less impactful:

  • Song URLs: These are web addresses where each song could be accessed or streamed on Spotify. These URLs themselves didn’t offer predictive value regarding whether a song would be Liked or Disliked since they don’t contain information about the song’s content or characteristics. These URLs were primarily used by our API script when requesting the corresponding song’s audio features. Thus, they are often considered redundant and removed from the dataset.
  • Artist URLs: Similar to song URLs, these are the web addresses for the artist profiles on Spotify and didn’t offer predictive value.

  • Average Segment Timbre: Timbre describes the texture or color of a musical sound distinguishing between different types of sounds and instruments. This specific audio metric was too detailed for our high-level analysis as it contained multiple levels of information (i.e., a list rather than a single value).

Additionally, in order to facilitate model training and evaluation, the target variable which indicated whether a song was Liked or Disliked was moved to the first column of our dataset.

3.4.2 Data Cleaning

Next, we examined the dataset for missing values and duplicates, as these could significantly affect the results and accuracy of our predictions. Prior to proceeding, we ensured the dataset was clean.

  • Missing Values: Leveraging the high-quality data from Spotify’s Web API, our dataset was generally complete. However, a meticulous search revealed that in 11 rows, the ‘AlbumReleaseDate’ feature included only the year, lacking a full date. This issue was addressed in the data handling section marked as ‘Feature Engineering’

  • Removing Duplicates: Due to Spotify’s ability to list a song more than once in a playlist, or given the fact that a song can be found multiple times on Spotify (e.g., original and with Featuring artist, or normal vs acoustic) we took measures to eliminate any duplicates, whether found twice within the same playlist or across both playlists.

3.4.3 Feature Engineering

To improve the predictability and quality of our model and ensure the usability of our data for machine learning models, we undertook several feature engineering in attempts to increase the datasets quality and predictive abilities. Below is a detailed breakdown of the features we engineered:

  • Age In Days: We created a new column to quantify the age of each track by transforming it from its release date to the age of the song in days. We believe that this transformation provided more granularity and precision to our model, allowed us to integrate the cyclical nature of time and was more consistent than a simple release date (White, 2017) (The Best Way to Encode Dates, Times, and Other Cyclical Features, n.d.).

  • Duration in Seconds: Originally, song durations were provided in milliseconds. We converted this measure into seconds to align with common perceptual timescales, making the data more intuitive and comparable across different analyses.

  • Average Segment Duration: We calculated the average duration of musical segments within each track to better understand the song’s structure and flow. Longer segments generally indicate a more drawn-out, possibly ambient style, while shorter segments suggest quicker, more dynamic changes within the track (e.g., Dance music has shorter segments, whereas Classical music will have longer segments) (Xiao et al., 2008)

  • Average Segment Loudness: We calculated the average loudness of segments to gain insights into the track’s dynamic range. The loudness level can affect a listener’s perception of a song’s energy and intensity, which are key elements in genres like electronic and rock music.

  • Average Section Tempo: This feature involved computing the average tempo across various sections of a track. Tempo variations within a song can significantly affect its emotional impact and energy level, impacting its likelihood of being liked or disliked. We believed that this would be an interesting feature when combined with features such as valence which measures the “Happiness” of a song for example.

  • Count of Music Genres: Given the nature of music genres on Spotify and their extreme granularity we weren’t able to explicitly use Genres as a feature in our modelling; this would have been too complex and precise (check out this interesting BBC article that discusses how Music genres have become so niche that they’re now becoming irrelevant). However, we still wanted to incorporate genres in some way into our project, therefore we added a feature to count the number of genres associated with each track. This count provided a sense of the artist’s versatility and breadth in musical styles, which could correlate with broader appeal across diverse listener groups. This could also be understood as a proxy for the “nicheness” of a given song.

  • Length of the Song Name: We calculated the number of words in each song’s title. This feature might seem trivial but could reflect certain trends in music marketing and titling that may appeal differently to various demographics.

    3.4.4 Average Segment Duration

    This feature represents the average duration of the musical segments within a track. A segment in Spotify’s terms usually represents a single musical event or note. The formula for calculating the average segment duration is:

    \text{Average Segment Duration} = \frac{\sum (\text{Segment Duration})}{\text{Number of Segments}}

    Where:

    • Segment Duration is the duration of each segment within the track.

    • Number of Segments is the total count of segments analyzed within the track.

    3.4.5 Average Segment Loudness

    This feature captures the average maximum loudness of the segments within a track. Loudness is a crucial feature in music analysis as it impacts the perceived energy and intensity of the track.

    \text{Average Segment Loudness} = \frac{\sum (\text{Segment Max Loudness})}{\text{Number of Segments}}

    Where:

    • Segment Max Loudness is the maximum loudness value recorded for each segment.

    • Number of Segments is the total count of segments analyzed within the track.

    3.4.6 Average Section Tempo

    The tempo of a song can vary between different sections. This average gives a general idea of the track’s overall tempo, considering its variability across different parts.

    \text{Average Section Tempo} = \frac{\sum (\text{Section Tempo})}{\text{Number of Sections}}

    Where:

    • Section Tempo is the tempo (beats per minute) of each section within the track.

    • Number of Sections is the total count of sections analyzed within the track.

    These calculated averages provide deep insights into the structural and acoustic properties of each track. By integrating these features into your dataset, we enhance the capability of our machine learning models to find patterns and preferences in our teammates music listening behaviors, improving the accuracy of predicting whether a song will be Liked or Disliked.

4 Exploratory Data Analysis

Once the dataset was prepared, we started an exploratory data analysis to gain initial insights into our dataset. This involved examining the distribution of variables based on ‘Liked’ status and identifying any outliers. Additionally, we explored the underlying structure of our data by conducting Principal Component Analysis (PCA) to determine if simplification was possible as well as creating a correlation matrix.

4.1 Distribution and Comparison of Variables by Liked Status (Density/Histogram Plots)

Initially, we analyzed the distribution of the 22 variables in our dataset, calculating the mean for each variable across the two ‘Liked’ statuses. In other words, we wanted to understand which Features had the largest absolute value difference between our target variable. In the table below, we highlighted the top five variables that showed the greatest divergence in their means, which were found to be statistically different with p-value >1%. The histo/density plots for all of the features can be found in the Appendix at the end of this report.

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5
  • Plot 1 illustrates the distribution of valence, which measures musical positiveness, for our two categories: Liked and Disliked songs. The red histogram and density plot represent disliked songs, while the blue ones represent liked songs. Liked songs peak around 0.7 valence, suggesting a preference for happier, more positive-sounding tracks. The mean valence for disliked songs is approximately 0.4, whereas for liked songs it is around 0.6, suggesting that higher valence is associated with a higher likelihood of a song being liked.

  • Plot 2 illustrates the distribution of Speechiness, which measures the presence of spoken words in a track. The mean Speechiness for disliked songs is approximately 0.1, whereas for liked songs it is around 0.05, suggesting that lower Speechiness is associated with a higher likelihood of a song being liked. This indicates a preference for music with fewer spoken words such as Dance music for example.

  • Plot 3 illustrates the distribution of Explicit content, a categorical variable indicating the presence of Explicit lyrics within a track. Given its categorical nature, the songs are distributed on either side of the histogram, with a larger proportion on the non-Explicit side (Explicit = 0) The mean explicit value for liked songs is approximately 0.05, indicating a clear preference for songs without Explicit lyrics. This suggests that songs with non-explicit content are significantly more likely to be favored.

  • Plot 4 illustrates the distribution of GenreCount, which represents the number of genres associated with each song, for liked and disliked songs. A higher GenreCount might indicate that a song is less mainstream and more specific, such as “Deep Symphonic Black Metal”. For liked songs, the mean GenreCount is approximately 2.5, suggesting a preference for songs with fewer genres. This implies that songs associated with fewer genres, and with a potentially more mainstream appeal, are more likely to be liked.

  • Plot 5 illustrates the distribution of WordCount, representing the number of words in song titles, for liked and disliked songs. Both distributions show a high concentration at lower word counts, however, the mean WordCount for liked songs is approximately 2.5, indicating a preference for shorter titles, suggesting that songs with fewer words in their titles are more likely to be liked.

4.3 Principal Component Analysis (Unsupervised Section)

To gain insights into the underlying structure of our data, we conducted a Principal Component Analysis (PCA), which simplified the complexity of the dataset through the use of dimension reduction while preserving the most important information (Dey, 2023). The primary results are shown in the Screeplot below.

The scree plot revealed that capturing approximately 70-80% (general rule of thumb for PCA (Elitsa Kaloyanova, n.d.) ) of the variance requires at least 10 dimensions. This finding suggested that our dataset was inherently high-dimensional, containing complex structures and relationships that couldn’t be easily reduced to just a few dimensions without significant information loss. Furthermore, the need for 10 dimensions indicated that there was minimal redundancy among the features and that many variables provided unique information as well as represented different audio insights essential for a comprehensive analysis. However, although using principal components instead of actual features could have been advantageous for datasets with 50-100 variables, in our case, with only about 22 variables, it was not beneficial to lose the original dimensionality of our dataset.

The following brief PCA explanation is based on the wonderful PCA articles linked below:

Principal Component Analysis (PCA) is a powerful statistical method used in machine learning to simplify complex data while preserving its essential parts. It helps in making the data easier to explore and visualize by reducing the number of variables without losing critical information (variance). This makes PCA incredibly useful for boosting the efficiency of algorithms and enhancing data visualization.

Understanding PCA helps you find the most important patterns in the data and express them as a combination of new variables, called principal components. These components are sorted so the most important ones come first. They are derived by reorienting the axis of the data towards where it varies the most. This method starts by analyzing the data’s spread (covariance) and identifying the directions (eigenvectors) that capture the most variation. These directions become a new, simpler way to look at your data, making it easier to analyze and visualize complex datasets.

For a deeper dive into PCA, consider the following resources:

Roshmita Dey’s article on Medium, “Understanding Principal Component Analysis (PCA)”, provides a beginner-friendly overview that emphasizes the intuition and foundational math behind PCA. The guide from Towards Data Science, “Principal Component Analysis: Everything You Need to Know”, discusses practical applications and includes a step-by-step walkthrough of PCA. Aimonk’s piece, “Principal Component Analysis (PCA) in Machine Learning” on Medium, explores the relevance and application of PCA in machine learning.

In the figure below, the two axes—Dim1 and Dim2—represent the first two principal components, together accounting for 30% of the variance in our dataset, with Dim1 explaining 17.8% and Dim2 explaining 12.2%. This graph illustrates that variables pointing in the same direction, such as “Valence” and “Energy,” are positively correlated, while those pointing in opposite directions, like “Acousticness” and “Energy,” indicate negative correlations. Variables at 90 degrees to each other, such as “Valence” and “Acousticness,” are uncorrelated. The importance of each variable in explaining the variance can be observed by their distance from the center of the plot with variables located further from the origin having a greater influence on the overall variance of the dataset. This graph provided an initial understanding of how variables interrelate and their respective contributions to the variance within the dataset.

In the two figures above, we can see a detailed breakdown of how each variable contributed to Dimension 1 and Dimension 2.

In the graph above , we incorporated data showing how each song, categorized as either “Liked” or “Disliked,” aligned with the attributes associated with our two principal dimensions. According to the graph, “Liked” songs predominantly clustered on the negative side of Dim1. For instance, this suggested a higher association with variables such as “Energy” and “Valence,” and a lower association with “Explicitness”. These findings were in line with our observations in the muti-column boxplot as well as in the density plots.

During the interpretation of our model, it will be interesting to see if these features have a large impact on the predictability of our models.

4.4 Correlation Matrix

To build further upon our PCA analysis, we constructed a heatmap to visualize the correlations among our 22 features. The heatmap predominantly displayed white and light colors, which, consistent with our PCA findings, indicated that most features had little correlation and thus contributed unique information to our dataset (and therefore models).

Nonetheless, we focused our attention on areas of the heat map that showed slightly more intense colors, indicating stronger positive or negative correlations between specific features. To enhance our understanding of the variables in the dataset and the underlying patterns in music, we chose to examine the five strongest correlations, which can be seen in the figures presented below.

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

Based on the presence of the regression lines and the distribution of data points, we derived a few key insights from the generated cloud plots above. (N.B: the reason why the categorical variable Explicit is not equal to 0 and 1 is due to the fact that the data was scaled in order to be able to plot them on the scatterplot)

  1. Acousticness vs Energy:
  • Negative Correlation: Research confirms that Acousticness and Energy in music tracks typically show a negative correlation. This relationship is logical, as acoustic tracks, which often feature natural and softer sounds, tend to have lower energy levels. Conversely, tracks with higher energy usually incorporate amplified and more dynamic sounds, reducing the acoustic quality. This contrast is evident in various music analysis studies that explore the characteristics influencing a song’s mood and intensity (Peterson, 2021/2021)(Classifying Genres in R Using Spotify Data, 2019).
  1. Danceability vs AvgSegDuration:
  • Negative Correlation: Analysis of danceable music shows a trend where tracks with higher Danceability often feature shorter and more repetitive segments. This characteristic makes the music catchier and easier to dance to, which aligns with findings that shorter segments can enhance a song’s rhythmic appeal and accessibility to listeners. The quicker, repetitive beats are more likely to sustain a listener’s attention and promote physical movement, which is crucial for dance music.
  1. Speechiness vs Explicit:
  • Positive Correlation: A significant positive correlation between Speechiness and explicit content is observed, particularly because tracks with higher Speechiness often contain more lyrics and spoken words, which increases the likelihood of explicit content. This correlation is particularly strong in genres like hip-hop or rap, where verbal expression is a central element, and explicit language is more common. The analysis confirms that as Speechiness increases, so does the probability of a track containing explicit words.
  1. Energy vs Loudness:
  • Positive Correlation: It’s well established that there is a strong positive correlation between energy and loudness in music tracks. Higher energy in music typically results from greater intensity and activity within the track, which often translates to higher loudness levels. This is particularly prevalent in genres like rock, pop, and electronic, where dynamic and powerful sounds create a perception of high energy. The relationship hinges on the fact that our auditory perception equates louder sounds with higher energy, which is a key factor in music mastering where consistent loudness is crucial for the listener’s experience across different tracks and platforms (LUFS 101, 2023).
  1. Instrumentalness vs ArtistPopularity:
  • Negative Correlation: The inverse relationship between instrumentalness and artist popularity can be partly explained by the commercial music landscape’s favoring of vocal-centric tracks. Songs featuring vocals tend to appeal to wider audiences, enhancing an artist’s popularity potential. Instrumental music, while valuable and popular within specific niches, often lacks the broad appeal of vocal music due to its more specialized or genre-specific nature. This trend reflects a general preference in the listening public, where easily relatable and lyrically driven songs tend to dominate mainstream channels and platforms, influencing artist popularity metrics (Loudness - Everything You Need To Know | Production Expert, n.d.).
Summary of DataFrame
Variable Minimum 1st Quartile Median Mean 3rd Quartile Maximum
Liked 0.00 0.00 0.50 0.50 1.00 1.00
ArtistPopularity 12.00 60.00 72.00 67.63 77.00 100.00
ArtistFollowers 116.00 544,398.75 3,742,145.00 12,049,075.55 11,615,511.50 113,928,979.00
AlbumPopularity 0.00 45.75 59.00 55.13 68.25 100.00
TrackPopularity 0.00 48.00 65.00 60.86 76.00 100.00
Explicit 0.00 0.00 0.00 0.23 0.00 1.00
Danceability 0.17 0.55 0.66 0.65 0.76 0.96
Energy 0.04 0.56 0.72 0.69 0.84 0.99
Key 0.00 2.00 5.50 5.42 9.00 11.00
Loudness −22.63 −7.86 −5.97 −6.56 −4.60 −1.19
Mode 0.00 0.00 1.00 0.61 1.00 1.00
Speechiness 0.02 0.04 0.05 0.09 0.11 0.59
Acousticness 0.00 0.02 0.10 0.20 0.31 0.99
Instrumentalness 0.00 0.00 0.00 0.04 0.00 0.94
Liveness 0.03 0.09 0.12 0.19 0.23 0.97
Valence 0.04 0.32 0.58 0.54 0.74 0.97
Tempo 61.88 105.02 123.15 122.95 133.97 209.94
TimeSignature 1.00 4.00 4.00 3.95 4.00 5.00
AvgSegDuration 0.19 0.24 0.26 0.27 0.29 0.43
AvgSegLoudness −25.69 −9.25 −7.02 −7.57 −5.10 −1.17
AvgSegTempo 54.93 104.79 122.04 121.94 133.93 209.70
AgeInDays 15.00 244.50 1,737.00 4,468.51 7,069.50 24,609.00
Duration 85.36 170.62 197.71 212.44 243.74 484.53
WordCount 1.00 2.00 3.00 3.40 5.00 14.00
GenreCount 1.00 1.00 3.00 2.97 4.00 9.00

4.5 EDA Hypotheses

Based on the observations derived from our exploratory data analysis, we propose the following hypotheses for further investigation in our report:

  1. Valence: Higher Valence in a song increases the likelihood of it being liked, suggesting that songs projecting a positive mood are preferred.

  2. Speechiness and Explicit Content: Lower levels of Speechiness and Explicitness content are associated with a higher likelihood of a song being liked, indicating a preference for non-explicit, melodic tracks.

  3. GenreCount: A lower number of Genres per song predicts higher likability, implying that straightforward, less complex musical genres are preferred.

  4. Segment Duration: Shorter segment durations within songs enhance likability, reflecting a preference for catchy and quickly engaging music.

5 Modelling (Supervised)

5.1 Data splitting

Before generating our predictive models, we organized our dataset into two main parts:

  1. Training Set: This portion was used to train our models, enabling it to learn from our data. It comprised 80% of our entire dataset and included both the input features and the corresponding target values.
  2. Test Set: This portion was utilized to evaluate the model’s performance after the training and hyperparameter tuning phase. It was essential for assessing how well the model could generalize to new, unseen data and made up the remaining 20% of our dataset.

Further, to refine our models and prevent overfitting and data leakage (Data Leakage and Its Effect on Machine Learning Models | by Swetha | Medium, n.d.), we subdivided our training set into two distinct subsets:

  1. Training Subset: This subset, comprising 80% of our initial training set, was used for ongoing model training.
  2. Validation Set: The remaining 20% of our initial training set served as the validation set. It was used to validate the baseline model’s performance, as well as the best performing hyperparameter tuned models ensuring the model’s effectiveness and preventing overfitting. This allowed us to directly compare the baseline and tuned models on the same validation set.

The figure below shows a visual explanation of the breakdown of our datasets into training, test and validation sub-sets.

Our Data Splits Visualized

In this process, we also employed stratified sampling for data splitting. This approach was chosen to ensure that the proportion of our target variables, Liked and Disliked, remained consistent with those of the full dataset (Igareta, 2021), which contained 400 observations equally divided between the two classes. Stratified sampling was crucial in maintaining balanced representation across our training and validation sets, thus mitigating the risk of bias and ensuring the robustness of our models.

5.2 Baseline Modelling

To determine the most effective model for predicting whether a song would be Liked or Disliked, we began by modelling baseline models. We utilized a variety of models, such as Logistic Regression, SVM, Classification Trees and Random Forests, to assess their initial baseline performance based on specific scores/metrics. Then, the model that demonstrated the best performance from this initial group was selected for further refinement through hyperparameter tuning.
All baseline models were run using the ‘Caret’ package and implemented a 5-fold cross validation (except for the Naïve model given that it doesn’t actually use a statistical model but just predicts the most probable category in the data). We chose a 5-fold cross-validation approach for our dataset with 400 observations to ensure that each fold contained a sufficient number of observations for robust model training and validation. A larger number of folds would have resulted in smaller training and validation sets, which might not have adequately representeded the data’s variability and could have lead to overfitting or underfitting issues. By using 5 folds, we balanced the need for a reliable model evaluation with the practical consideration of maintaining reasonably sized training and test sets (Does It Matter?, n.d.) (Data Science - K-Fold Cross-Validation How Many Folds? - Stack Overflow, n.d.) (3.1. Cross-Validation: Evaluating Estimator Performance — Scikit-Learn 1.4.2 Documentation, n.d.). The diagram below visualizes how Cross Validation works by splitting the training set into 5 parts (5-fold), selecting four parts to train the model and using the fifth and final part as a validation set.

Cross Validation Visualized - Inspired by (Agrawal, 2020)

5.2.1 Naive Model

The first model we implemented was a Naïve Model, which predicted the most probable category based on our data (essentially taking a guess). Since our data splits were stratified, this model effectively made predictions with a 50/50 probability (open callout to see confusion matrix and scores). Serving as a completely random baseline, this Naïve Model helped us set a foundational benchmark for evaluating other models. Essentially, if the accuracy (and/or other relevant scores) of subsequent models did not surpass that of the Naïve Model, those models would be considered ineffective and not worth further implementation.

Metric Value
Accuracy 0.5
95% CI (0.372, 0.628)
No Information Rate 0.5
P-Value [Acc > NIR] 0.55
Kappa 0
Mcnemar's Test P-Value 4.2514e-08
Sensitivity 1
Specificity 0
Pos Pred Value 0.5
Neg Pred Value NaN
Prevalence 0.5
Detection Rate 0.5
Detection Prevalence 1
Balanced Accuracy 0.5
'Positive' Class Negative

How we calculated the Naive Model:

  1. Determine the Most Frequent Category:

    • Calculate the frequency of each category in the training set.
    • Identify the category with the highest frequency.

    Let: \text{Category\ Counts} = \{C_1: n_1, C_2: n_2, \ldots, C_k: n_k\} where ( C_i ) represents the (i)-th category and ( n_i ) is the number of occurrences of ( C_i ) in the training set.

    The most frequent category ( C_{} ) is: C_{\text{max}} = \arg\max_{C_i} (n_i)

  2. Predict the Most Frequent Category for the Validation Set:

    • For each observation in the validation set, assign the most frequent category ( C_{} ).

    Let ( N_{} ) be the number of observations in the validation set.

    The predictions ( _{} ) for the validation set are: \hat{Y}_{\text{naive}} = \{C_{\text{max}}, C_{\text{max}}, \ldots, C_{\text{max}}\} \quad \text{(repeated \(N_{\text{val}}\) times)}

This simple approach provides a benchmark to assess the performance of more complex models. By comparing other models to this naive baseline, we can better understand their predictive power and improvements.


5.2.2 Logistic Regression

The second model we coded was a Logistic Regression (LR) model. Given the binary classification exercise (predicting a 0 or 1) this is the one of the most commonly used model for these cases (Choosing a Model for Binary Classification Problem | by Andrii Gozhulovskyi | Medium, n.d.). LR models are statistical models that model the probability of a binary outcome based on features of the dataset. It then uses a logistic function to transform the linear combination of these features into a probability (i.e., this ensures that it’s between 0 and 1)). If the probably is higher than a given threshold such as in our case where it’s 0.5 it predicts a positive outcome and vis versa. For a detailed explanation on how LR models work you can check out Natassha Selvaraj’s wonderful “Logistic Regression Explained in 7 Minutes” article on Medium/Towards Data Science.

Baseline Model Setup:

  • Package Used: caret

  • Function: train()

  • Model Type: Generalized Linear Model (GLM) with a logistic link (binomial family)

  • Cross-Validation: 5-fold cross-validation

  • Predictive Performance Evaluation: Standard classification metrics (accuracy, sensitivity, specificity)

  • Probability Prediction: Enabled (class probabilities)

As can be seen the accuracy of this model was approximately 76.56%, which is a good improvement compared to the naive model. In addition, the confidence interval for the accuracy indicates that the true accuracy of the model, given the data, is expected to be between 64.3% and 86.2%. The kappa statistic is also moderate, 0.53125, indicating a fair agreement beyond chance between the predicted and actual classifications. The model correctly identifies 81.25% of actual positives, demonstrating good sensitivity and correctly rejects 71.875% of actual negatives, indicating reasonable specificity.

Metric Value
Accuracy 0.765625
95% CI (0.643, 0.862)
No Information Rate 0.5
P-Value [Acc > NIR] 0
Kappa 0.53125
Mcnemar's Test P-Value 0.60558
Sensitivity 0.8125
Specificity 0.71875
Pos Pred Value 0.742857142857143
Neg Pred Value 0.793103448275862
Prevalence 0.5
Detection Rate 0.40625
Detection Prevalence 0.546875
Balanced Accuracy 0.765625
'Positive' Class Positive


5.2.3 Classification Trees

The third model we used was a Classification Tree, which helped visualize how different features influenced the prediction of a song being “Liked” or “Disliked”. Classification trees are non-parametric supervised learning method used for both classification and regression tasks. They work by splitting the data into subsets, creating a tree-like model of the decisions leading to a final classification. These models can be great as they are easily interpretable, understandable and can be visualized (white-box-model). However they need to be used with caution as they can easily create overly complicated trees (overfitting on training data) that don’t generalize well to unseen data (e.g., a test set) (1.10. Decision Trees, n.d.). To learn more about Classification Trees you can check out Normalized Nerd’s video which clearly explains in detail how it works.

Baseline Model Setup:

  • Package Used: caret for model training and rpart for the algorithm

  • Function: train() from caret and rpart.plot() for visualization

  • Model Type: Recursive Partitioning and Regression Trees (rpart)

  • Cross-Validation: 5-fold cross-validation to assess model performance

  • Parameter Tuning: Complexity parameter (cp) set to 0 to prevent automatic tuning and provide a baseline for comparison

(N.B: Given that rpart automatically tunes the hyperparameters of the model when using cross validation, we set the cp = 0 in order to get a baseline model that could be compared to the other models.)

As for the result, the following features “Explicitness”, “Valence”, “Duration”, “GenreCount”, “AgeInDays”, “Speechiness”, “ArtistFollowers”, “AlbumPopularity” were used as decision nodes in the tree. Each node split (e.g., Valence < 0.38) suggested that the feature significantly impacted the classification decision. Leaf Nodes represented the probability of a song being classified as positive or negative based on the combination of features leading to that node. For instance, if a song’s “ArtistFollowers” were greater than 8.9 million and its “AlbumPopularity” was less than 33, it had an 44% probability of being liked.

However, looking at the table below, we can see that only about 64.06% of all predictions made by the model were correct, with a Confidence Interval suggesting that the true accuracy of the model could vary between 51.1% and 75.7%. In addition, the Kappa value of 0.28125 indicated modest predictive strength. This assessment was supported by the confusion matrix, which highlighted opportunities for improvement in both specificity and sensitivity, suggesting that the model could be more effective in reducing both types of prediction errors.

Metric Value
Accuracy 0.640625
95% CI (0.511, 0.757)
No Information Rate 0.5
P-Value [Acc > NIR] 0.02
Kappa 0.28125
Mcnemar's Test P-Value 0.40425
Sensitivity 0.71875
Specificity 0.5625
Pos Pred Value 0.621621621621622
Neg Pred Value 0.666666666666667
Prevalence 0.5
Detection Rate 0.359375
Detection Prevalence 0.578125
Balanced Accuracy 0.640625
'Positive' Class Positive


5.2.4 Support Vector Machine

We next set out to perform both Linear and Radial support vector machines – however both methods ended up giving relatively similar results. Keeping Occam’s Razor in mind, we preferred to take the simplest route to ensure that we didn’t overcomplex the model (Occam’s Razor in Machine Learning: Examples - Analytics Yogi, n.d.) (Hsu et al., n.d.).  A linear Support Vector Machine (SVM) is a supervised learning algorithm used for binary classification that finds the optimal hyperplane separating two classes by maximizing the margin between them. The data points closest to the hyperplane, called support vectors, determine the position and orientation of the hyperplane. This approach is effective for linearly separable data and ensures a robust classification with the largest possible margin. To learn more about this method you can check out Lujing Chen’s extremely well written article “Support Vector Machine – Simply Explained” on Medium.

Baseline Model Setup:

  • Package Used: caret for model training.

  • Function: train() from caret.

  • Model Type: Support Vector Machine (SVM), specifically linear SVM for simplicity and effectiveness in binary classification tasks (RPubs - SVM with CARET, n.d.).

  • Cross-Validation: 5-fold cross-validation.

  • Feature Scaling: Applied as a pre-processing step (preProcess = “scale”). This is critical because SVM relies on maximizing the margin between classes, which is sensitive to the scale of the features.

This model demonstrated a relatively strong performance, with approximately 75% of its predictions being accurate. The Kappa statistic of 0.5 further supported this assessment, indicating decent predictive strength. A notable aspect of this model was its superior ability to correctly predict negative cases, as evidenced by a specificity of 84.38%. However, its sensitivity, which measures the accuracy of predicting positive cases, was lower at only 65.63%. This disparity suggested that there is room for improvement in the model’s ability to identify positive cases more effectively.

Metric Value
Accuracy 0.75
95% CI (0.626, 0.85)
No Information Rate 0.5
P-Value [Acc > NIR] 0
Kappa 0.5
Mcnemar's Test P-Value 0.2113
Sensitivity 0.65625
Specificity 0.84375
Pos Pred Value 0.807692307692308
Neg Pred Value 0.710526315789474
Prevalence 0.5
Detection Rate 0.328125
Detection Prevalence 0.40625
Balanced Accuracy 0.75
'Positive' Class Negative


5.2.5 Random Forest (Ensemble Method)

A Random Forest model is an ensemble learning method. “An ensemble learner is made of several learners – so called base learners or sub-learners that are combined for the prediction” (MLBA - S24 - Ensemble Methods, n.d.). In other words it operates by constructing multiple decision trees (see Classification Section for more on trees) then averaging to produce a final single prediction. This technique is especially powerful as it combines the simplicity of a classification tree with the ability to correct for their tendency to overfit to the training set. This model is also highly performing for binary classification tasks as it can handle a large number of features in our dataset and the complex interactions between them. However this complexity introduced through the Ensemble method (combining multiple trees together) comes with the cost of turning this model into a black-box-model, meaning that it is no longer interpretable easily through visualizations for example. The figure below shows the difference between our previous Classification Tree and a Random Forest (in this case the RF only have 3 trees, however in our baseline model we already are using 100 trees – this is just for demonstration purposes). To learn more, please check out “Machine Learning-Decision Trees and Random Forest Classifiers” article by Karan Kashyap on Medium.

Classification vs Random Forest

Baseline Model Setup:

  • Package Used: caret for model training.

  • Function: train() from caret.

  • Model Type: Random Forest (RF), an ensemble learning method that constructs multiple decision trees and aggregates their predictions to enhance model accuracy and control overfitting.

  • Cross-Validation: 5-fold cross-validation to validate the model’s effectiveness.

  • Number of Trees: Initially set to 100 to balance between computational efficiency and predictive accuracy.

This model accurately predicted approximately 70.31% of all cases. A Kappa value of 0.40625 indicated a moderate agreement beyond chance between the predicted and actual classifications, reflecting fair predictive strength. While the model demonstrated a moderate ability to detect positive cases with a sensitivity of 62.5%, it excelled in identifying negative cases with a specificity of 78.13%, showcasing its robustness in correctly rejecting negative instances.

Metric Value
Accuracy 0.703125
95% CI (0.576, 0.811)
No Information Rate 0.5
P-Value [Acc > NIR] 0
Kappa 0.40625
Mcnemar's Test P-Value 0.3588
Sensitivity 0.625
Specificity 0.78125
Pos Pred Value 0.740740740740741
Neg Pred Value 0.675675675675676
Prevalence 0.5
Detection Rate 0.3125
Detection Prevalence 0.421875
Balanced Accuracy 0.703125
'Positive' Class Negative

5.2.6 Comparison/Results of Baseline Models

When determining the baseline performance of our models, it was important to define what constitutes a “successful” or “high performing” model. To accurately assess our models’ effectiveness, we chose specific metrics for evaluation. Given the balanced nature of our datasets, achieved through stratified splitting, and the equal importance we placed on both Positive and Negative predictions, we settled on the following scores to guide our evaluation of the models’ performance.

Score/Metric Description
Balanced Accuracy Given the balanced nature of our dataset, this score indicates the overall proportion of correct predictions (i.e., in this case given the stratification Balanced Accuracy = Accuracy)
ROC AUC Measures the model’s ability to distinguish between classes across all threshold levels, combining both the sensitivity and specificity.
Kappa Measures the agreement between predicted and actual classifications beyond chance, providing a key insight into the model’s predictive strength.
Precision Precision measures the proportion of positive predictions that are actually correct.
Recall (Sensitivity) Recall measures the proportion of actual positives correctly identified by the model.

The following table provides these key metrics for all of our models. For a comprehensive table including all scores for all our models (e.g., F1_Score, PosPredValue, etc) please refer to the Appendix.

Baseline Model Performance Comparison
Performance evaluated on Validation Set
Metric Naive Logistic Regression Decision Tree SVM Linear Random Forest
BalancedAccuracy 0.500 0.766 0.641 0.750 0.703
Kappa 0.000 0.531 0.281 0.500 0.406
Precision 0.500 0.743 0.622 0.808 0.741
Recall 1.000 0.812 0.719 0.656 0.625

Understanding Kappa

Cohen’s Kappa measures the agreement between a classification model’s predictions and the actual outcomes, adjusting for chance agreement.

Kappa Table Summary (Bobbitt, 2021)
Kappa Value Interpretation
0 No Agreement
0.10-0.20 Slight Agreement
0.21-0.40 Fair Agreeement
0.41-0.60 Moderate Agreement
0.61 Substantial Agreement
0.81-0.9 Near Perfect Agreement
1.00 Perfect Agreement

Naïve:

  • Model demonstrated the worst performance as it was just predicting the most probable category. Its Balanced Accuracy and ROC AUC were at 0.5, and its Kappa was at 0, meaning it had no agreement. Since it was predicting only one case, its Recall was perfect (1.000), but its precision was low (0.500), meaning that half of the positive predictions were incorrect, understandably.

Classification Tree:

  • Model had the worst performance among all baseline models (except Naïve), with balanced accuracy and ROC AUC at 0.641. The Kappa value (0.281) suggested fair agreement. It showed decent recall (0.719) but lower precision (0.622), indicating that while it identified positive cases reasonably well, its positive predictions were less reliable.

SVM:

  • Model performed well, with balanced accuracy (0.750) and ROC AUC (0.781). The Kappa value (0.500) indicated moderate agreement. It balanced high precision (0.808) with moderate recall (0.656), making it effective in identifying both positive and negative instances, though slightly favoring precision.

Logistic Regression:

  • The Logistic Regression demonstrated the best overall performance amongst the models. It achieved the highest balanced accuracy of 76.6%, indicating it correctly identifies both liked and disliked songs with a high degree of accuracy. Its ROC AUC of 0.791 showed it had a strong ability to distinguish between the two classes, performing well across different threshold levels. The Kappa statistic of 0.531 suggested moderate agreement between predicted and actual classifications beyond chance, highlighting its predictive strength. The baseline model also excelled in recall (0.812), meaning it was highly effective in identifying songs that would be liked. Precision (0.743) was also strong, indicating that when the model predicted a “Liked” song, it was often correct. This balance between high recall and good precision-made Logistic Regression a robust choice for predicting song preferences, ensuring both high identification of positive cases and reliability of positive predictions.

Random Forest:

  • Random Forest also performed exceptionally well, particularly in terms of ROC AUC, with the highest score of 0.835. This indicated that the model had an excellent ability to distinguish between liked and disliked songs across all thresholds. Its balanced accuracy was 70.3%, suggesting it effectively handled both classes. The Kappa value for Random Forest was 0.406, showing moderate agreement beyond chance. The model had high precision (0.741), meaning its positive predictions (liked songs) were highly reliable. However, its recall (0.625) was lower than Logistic Regression, indicating it missed more positive cases. Despite this, the high precision and overall discriminative power of Random Forest made it a valuable model for this classification task.

Given the following scores from our baseline models on our Validation set we chose to continue our exercise by selecting the Logistic Regression and the Random Forest. We believed that both models exhibited strong, balanced performance, making them ideal candidates for further optimization. Logistic Regression was particularly strong in identifying positive cases (liked songs) with high recall and good precision. Random Forest excelled in discriminative power and precision, ensuring that its positive predictions were highly reliable. These models’ balanced performance and strong predictive capabilities justify their selection for further hyperparameter tuning.

The full results of all the baseline models can be found in the Appendix.

5.3 Hyperparameter Tuning

After choosing the Logistic Regression and Random Forest, we moved onto tuning the hyperparameters of the models in an attempt to increase their predictive capabilities even further. In other words, hyperparameter tuning consists of attempting to identify a set of optimal parameters for a machine learning algorithm to increase its performance and render it more robust (Hyperparameter Tuning Overview | BigQuery, n.d.) (What Is Hyperparameter Tuning?, n.d.). Different models offer different hyperparameters that can be tuned.

5.3.1 Logistic Regression Tuning

For the Logistic Regression model, we applied Elastic Net regularization, which combined both Lasso (L1) and Ridge (L2) regularization techniques. Lasso and Ridge were two types of regularization techniques that helped to prevent overfitting by adding a penalty to the loss function during the model’s training. Lasso regression uses a L1 penalty to promote sparsity among the coefficients, potentially reducing some to zero, whereas Ridge regression employs a L2 penalty which shrinks the coefficients towards zero but does not set them to zero. Our method, Elastic net combined the properties of both. If you’d like to learn more about the Elastic Net check this article by Rohit Bhadauriya “Lasso ,Ridge & Elastic Net Regression: A Complete Understanding (2021)

We defined a grid for the two key hyperparameters: ‘alpha’ and ‘lambda’. The ‘alpha’ controlled the mix between the Lasso and Ridge regularization, which we explored in increments of 0.2. We attempted to explore this in smaller increments such as 0.001, however we noticed the same results with increased computation times. The ’lambda’ parameter controlled the strength of this regularization, which we tested on a logarithmic scale from 10^-4 to 10^0. As with the other models we used 5-fold-cross validation to ensure robustness.

The optimal tuning parameters for the Logistic Regression were:

  • Alpha = 0

  • Lambda = 0.1

Code
set.seed(123)
# Define Train Control
trControlglmtuning <- trainControl(
  method = "cv",
  number = 5,
  savePredictions = "none",
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Define Grid for Elastic Net (Lasso and Ridge)
tuneGrid <- expand.grid(
  alpha = seq(0, 1, by = 0.2),  # Lasso (1) to Ridge (0)
  lambda = 10^seq(-4, 0, by = 1)
)

# Train Model with Elastic Net
set.seed(123)
glmnet_model <- train(
  Liked ~ .,
  data = df_tr,
  method = "glmnet",
  trControl = trControlglmtuning,
  tuneGrid = tuneGrid,
  metric = "ROC"
)

print("Best Parameters for Random Forest:")
print(glmnet_model$bestTune)

# Predict probabilities for the validation set
prob_val <- predict(glmnet_model, newdata=df_val, type="prob")

# Assuming the class probabilities for the positive class ("Positive") are under the "Positive" column
pred_val_liked <- ifelse(prob_val[, "Positive"] >= 0.5, "Positive", "Negative")

# Convert predictions and actuals to factor ensuring same levels
pred_val_liked <- factor(pred_val_liked, levels=c("Negative", "Positive"))
actuals_val_liked <- factor(df_val$Liked, levels=c("Negative", "Positive"))

confusion_matrix_glm_tuned <- confusionMatrix(pred_val_liked, actuals_val_liked, positive="Positive")

# Print the confusion matrix to see various metrics for the validation set
print(confusion_matrix_glm_tuned)

5.3.2 Random Forest Tuning

For the Random Forest model, we focused on three primary hyperparameters: ‘mtry’,’ntree’, and ‘nodesize’. The’mtry’ parameter, representing the number of features considered at each split, was tuned using a grid search with values of 2, 4, 6, 8, and 10. To find the optimum number of trees (‘ntree’) and the minimum size for the terminal nodes (’nodesize’), we conducted a nested loop search with ‘ntree’ values of 100,200,300 and’nodesize’ values of 1,5,10. As with the Logistic Regression tuning we also attempted other parameters without significant changes and increased computation times. As with all other models within this study we ran a 5-fold cross-validation.

The optimal tuning parameters for the Random Forest were:

  • mtry = 2

  • ntree = 100

  • nodesize = 5

Code
set.seed(123)
trControlRFTuning <- trainControl(
  method = "cv",           # Using standard k-fold cross-validation
  number = 5,             # Number of folds in the cross-validation
  savePredictions = "none",
  classProbs = TRUE,       # Save class probabilities for potential ROC analysis later
  summaryFunction = twoClassSummary  # Use summary statistics (ROC, Sensitivity, etc.)
)

# Grid Search for Tuning Random Forest (Only for `mtry`)
tuneGridRF <- expand.grid(
  mtry = c(2, 4, 6, 8, 10)  # Number of variables to try at each split
)

# Initialize Best Model and Metrics
best_model <- NULL
best_conf_matrix <- NULL
best_accuracy <- 0
best_params <- list(mtry = NA, ntree = NA, nodesize = NA)

# Nested Loop for ntree and nodesize
ntree_values <- c(100, 200, 300)
nodesize_values <- c(1, 5, 10)

for (ntree in ntree_values) {
  for (nodesize in nodesize_values) {
    set.seed(123)
    rf_model <- train(
      Liked ~ .,
      data = df_tr,
      method = "rf",
      trControl = trControlRFTuning,
      tuneGrid = tuneGridRF,
      ntree = ntree,               # Number of trees
      nodesize = nodesize,         # Minimum size of terminal nodes
      importance = TRUE            # Calculate variable importance
    )

    # Evaluate Performance on Validation Tuning Set (df_val)
    predictions_rf <- predict(rf_model, newdata = df_val)
    confusion_matrix_rf_tune <- confusionMatrix(predictions_rf, df_val$Liked, positive = "Positive")
    accuracy_rf <- confusion_matrix_rf_tune$overall["Accuracy"]

    # Check if it's the best model
    if (accuracy_rf > best_accuracy) {
      best_accuracy <- accuracy_rf
      best_model <- rf_model
      confusion_matrix_rf_tuned <- confusion_matrix_rf_tune
      best_params$mtry <- rf_model$bestTune$mtry
      best_params$ntree <- ntree
      best_params$nodesize <- nodesize
    }
  }
}

# Display Best Parameters and Confusion Matrix
print("Best Model Parameters for Random Forest:")
print(best_params)
print(confusion_matrix_rf_tuned)

5.3.3 Comparing Baseline vs Tuned Models

In the following Table and ROC curve below we can see the performance of the baseline Logistic Regression and Random Forest compared to the newly hyperparameter tuned models.

Baseline vs Tuned Model Performance Comparison
Performance evaluated on Validation Set
Metric Baseline Logistic Regression Logistic Regression Baseline Random Forest Random Forest
BalancedAccuracy 0.766 0.781 0.703 0.766
Kappa 0.531 0.562 0.406 0.531
Precision 0.743 0.750 0.741 0.730
Recall 0.812 0.844 0.625 0.844

Both the Logistic Regression and Random Forest models showed improvements in performance on the Validation Set. The Logistic Regression model, for instance, saw an increase in Balanced Accuracy from 0.766 to 0.781 and a rise in Kappa from 0.531 to 0.562, indicating better agreement beyond chance. Precision also improved slightly from 0.743 to 0.750, while Recall increased from 0.812 to 0.844, showing enhanced ability to correctly identify positive cases. The ROC AUC metric for Logistic Regression increased from 0.791 to 0.818, reflecting a better overall discrimination between classes.

Similarly, the Random Forest model demonstrated improvements, with Balanced Accuracy increasing from 0.703 to 0.766 and Kappa rising from 0.406 to 0.531. Although there was a slight decrease in Precision from 0.741 to 0.730, the Recall significantly improved from 0.625 to 0.844, indicating a much better performance in identifying positive cases. The ROC AUC for the Random Forest model also saw a modest increase from 0.835 to 0.839. These improvements suggested that the hyperparameter tuning process successfully enhanced the models’ ability to generalize and make accurate predictions, providing a more robust performance on the validation set.


5.4 Interpretation of the model(s)

Once we obtained the optimal tuning parameters, we then retrained our models using these hyperparameters on the Complete Training Set by combining back the Validation sets and Training sets together. We then proceeded to test the models on the final Test set and received the following results which can be seen in the table and ROC AUC curves below.

5.4.1 Model Results on Test Set

Train vs. Test Model Performance Comparison
Performance comparasion between Validation and Test Sets
Metric Logistic Regression Validation Logistic Regression Test Random Forest Validation Random Forest Test
BalancedAccuracy 0.781 0.787 0.766 0.812
Kappa 0.562 0.575 0.531 0.625
Precision 0.750 0.767 0.730 0.778
Recall 0.844 0.825 0.844 0.875

The Balanced Accuracy for Logistic Regression on the test set was 0.787, slightly higher than its validation performance of 0.781. Similarly, the Random Forest model showed a small improvement in Balanced Accuracy on the test set, rising to 0.812 from 0.766 on the validation set. It is noteworthy that the test set results were slightly better than the validation set results, which is unusual. However, we believe this is because the models were retrained on the combined training and validation datasets, increasing the number of observations by 25%. This augmentation of the training data likely enhanced the predictive capacity of the models, allowing them to perform better on the test set.

The Kappa statistic also showed a slight increase for both models on the test set. Logistic Regression’s Kappa improved from 0.562 to 0.575, and Random Forest’s Kappa increased from 0.531 to 0.625. These improvements suggest that the models had a stronger predictive strength when evaluated on the test set.

Precision and Recall metrics further supported the robustness of our models. Logistic Regression’s Precision increased from 0.750 to 0.767, indicating that a higher proportion of positive predictions were correct on the test set. Its Recall, however, slightly decreased from 0.844 to 0.825, indicating a minor drop in identifying all actual positives. For the Random Forest model, both Precision and Recall improved on the test set, with Precision increasing from 0.730 to 0.778 and Recall rising from 0.844 to 0.875. This showed that the Random Forest model was better at identifying both positive instances and the correctness of those predictions on the test set.

5.4.2 ML Interpretation

5.4.2.1 Variable Importance Plots

This section aims to assess the significance of variables in our final models, tested on the Test Set, using the ‘DALEX’ package. We constructed explainer objects for both the Logistic Regression and the Random Forest tuned models. This setup enabled us to measure how the absence of each predictor affected the model’s predictive accuracy, specifically using the decrease in AUC as our metric of evaluation.

Logistic Regression Model:

  • Dominant Variables: The logistic regression analysis highlighted ‘Valence’, ‘Explicit’, ‘GenreCount’, and ‘WordCount’ as the most significant predictors. This aligned with the exploratory analysis, confirming the strong influence of these features on a song being Liked.

  • Valence: The consistent prominence of ‘Valence’ across different permutations underscored its importance in affecting listener preference, likely due to its direct impact on the emotional tone of music. (Arjmand et al., 2017)

  • Explicit Content: Songs labeled as ‘Explicit’ tended to be impactful, possibly reflecting listener sensitivity to lyrical content. This variable showed a significant dropout loss, indicating its strong effect in predicting dislikes or likes.

  • Genre and Word Count: Both ‘GenreCount’ and ‘WordCount’ demonstrated notable importance, suggesting that the diversity of genres and lyrical complexity were crucial in shaping listener preferences.

  • Instrumentalness: This variable came in as the 6th most important variable for predicting if a song was liked or disliked. This is interesting given that the average for both the "Liked" and "Disliked" classes in our dataset were relatively low.

Random Forest Model:

  • Dominant Variables: The Random Forest analysis highlighted ‘Valence’, ‘Explicit’, ‘WordCount’, and ‘GenreCount’ as the most significant predictors. This was aligned with those of the Logistic Regression and our findings from the EDA, confirming the strong influence of these features on a song being liked.

  • Feature Interaction Sensitivity: The Random Forest model exhibited a broader sensitivity to various features, highlighting its capacity to capture complex interactions between variables better than logistic regression.

  • Acousticness: In the Random Forest model, Acousticness was the fifth most important variable, whereas it had no impact in the Logistic Regression model. This distinction highlighted that the Random Forest was capable of capturing interactions with variables that the Logistic Regression could not.

  • Average Duration: ‘AvgSegDuration’ appeared as the eighth most important variable in the Random Forest model but was significantly less important in the Logistic Regression model. We believe this was due to its high correlation with other variables in our dataset, again underscoring the ability of the Random Forest to capture feature interactions that the Logistic Regression did not.

In summary, it’s noteworthy that four of the top five variables—Valence, Explicit, GenreCount, and WordCount—aligned with those identified in our exploratory analysis and part of our hypothesis. Additionally, Average Segment Duration emerged as a significant predictor for determining the likelihood of a song being liked however only for the Random Forest. Valence and Explicit dominated both in the random forest model and logistic regression, underscoring their robust influence across different modeling approaches. The high importance of Valence is likely due to its direct impact on the listener’s emotional and psychological responses while explicit lyrics are often found in Rap, a genre very often found in "disliked" songs. GenreCount and Word count also showed a notable importance. Average Segment Duration, which measures the length of song segments, also showed some significance. This might be attributed to the tendency of pop songs, which were significantly favored in the "Liked" playlists, to feature shorter, catchier segments.

It is also important to note that Variable Importance plots do not look into the interactivity between variables. For example Acousticness might be not important for the the Logisitic Regression on this graph, however it’s interaction with Explicitness for example could drastically impact the predictive quality of the model. This is one of the major disadvantages of using Variable Importance.

5.4.2.2 Partial Dependence Plots

PDP 1

PDP 2

PDP 3

PDP 4

PDP 5

PDP 6

PDP 7

In order to further interpret our final two models, we conducted Partial Dependence Plots (PDP) using the PDP package in R.Contrary to the Variable Importance Plots that can be seen above, a PDP "gives the curve representing how much the variable affects to the final prediction at which value range of the variable" (DEI, 2019). In Layman's terms, we're looking to understand the association between a feature (e.g., Valence) and the probability of prediction of a song (e.g., Disliked) (MLBA - S24 - Interpretable ML, n.d.).

  • Valence

    • Logistic Regression: Our results revealed that the likelihood of a song being predicted as Liked progressively increased with its Valence. Specifically, the probability around at 24% for songs with low Valence (near 0) and climbed to about ~72.5% as Valence approached 1. This demonstrated a strong positive (linear) correlation between a song’s Valence and its likelihood of being predicted as Liked, suggesting that the relationship between valence and the probability of a song being liked was almost perfectly linear.

    • Random Forest: In the Random Forest model, the influence of valence on predicting whether a song was Liked was also positive but less pronounced compared to the Logistic Regression model. Across a broad range of valence levels, the probability that a song was Liked remained between ~45% and ~72.5%, suggesting that valence had a smaller impact on the outcome in the Random Forest model than in the Logistic Regression model albeit positive.

  • Explicit Content (Categorical Variable)

    • Logistic Regression: The explicit variable had a probability of 0.5% for the Logistic Regression when equal to 0 (not Explicit) and 0.3 when Explicit (=1). This was in line with our previous findings that stated that when a Song was Explicit the probability that it was liked was smaller.

    • Random Forest: The explicit variable followed the same pattern as with the Logistic Regression, however interestingly the probability of predicting a song as Liked when it was not Explicit was higher than that of the Logistic regression by ~5%. Further, when the song was Explicit it had less of a probability of predicting it as unliked when compared to that of the Logistic Regression. This was most likely due to the fact that the Random Forest was able to interpret interactivity between variables allowing it to be more subtle in its predictions.

  • Word Count in Song Titles:

    • Logistic Regression: Our results showed that as the word count in a song’s title increased, the likelihood of the song being predicted as “Liked” also increased linearly. The probability began at approximately ~45% with a one-word title and rose to about ~70% for titles containing 10 words. This correlation between the number of words in a song’s title and its likelihood of being liked was also noted in the Exploratory Data Analysis (EDA) section of our report.

    • Random Forest: A relatively consistent probability of predicting a song as "Liked" was maintained, ranging from about ~59% at 1 word to 70% when at 4 words or above. This suggests that word count had a smaller impact on predicting a Liked song in this model compared to the Logistic Regression model.

  • Genre Count:

    • Logistic Regression: Our results indicated a slight decrease in the probability of a song being liked as the genre count increased, dropping from approximately ~59% for songs with a single genre to about ~27.5% for songs with up to eight genres. This suggested that songs with more genre labels were less likely to be liked.

    • Random Forest: Our results showed a higher, more consistent probability of a song being liked, ranging from about ~67.5% to ~56%, even as the genre count increased from 1 to 8 genres. This indicated a less pronounced impact on song likability compared to the Logistic Regression model.

  • Instrumentalness:

    • Logistic Regression: Our results showed a decreasing trend in likability as Instrumentalness increased, starting at approximately 52.5% for non-instrumental tracks and dropping to around 25% for fully instrumental tracks. This suggested that songs with higher instrumental content were generally less liked.

    • Random Forest: Our results displayed very minimal change, maintaining the likelihood of a song being liked at about 70% when a song wasn’t instrumental to 57% at full instrumentalness. This indicated that Instrumentalness did not significantly impact song likability in this model.

  • Age in Days:

    • Logistic Regression: There was a general increase in the likelihood of a song being predicted Liked as it ages, starting from about ~45% for newer songs and maxing out to ~66% as it aged. This indicates that older songs are more appealing to Camille's taste.

    • Random Forest: Our results showed a less linear pattern starting at around 63%, remaining stable until around ~12500 days then drastically increasing the probability to ~77.5% of predicting a liked song.

  • Average Segment Duration:

    • Logistic Regression: Our results showed a increasing linear trend starting from ~45% when the average duration was around 0.2, as it increased so did the probability of predicting a Liked song reaching ~ 57% prediction at an average duration of 0.4.

    • Random Forest: Our results showed a less straightforward pattern, with the prediction probability of a liked song starting at ~46% at 0.2 duration, then drastically increasing to ~67.5% as we changed from 0.2 to 0.25. It plateaus out between 0.25 and 0.35 then drops again after to approximately ~57% probability of predicting a liked song.

The PDPs were intriguing and highlighted the differences in how different models—specifically our Logistic Regression and Random Forest—processed features in relation to predicting a song being Liked. The Logistic Regression demonstrated a “broader” sensitivity to changes in feature values, while the Random Forest generally showed more stability in probability outcomes across different levels of a feature, such as the impact on the probability of predicting a song Liked when the Valence increased. Our hypothesis was that Random Forests handled non-linear relationships and interactions better than Logistic Regression (Hershy, 2020).

6 Conclusion

In conclusion, our project, “Skip or Replay? Predicting Song Preferences on Spotify Using Machine Learning,” successfully leveraged the rich dataset provided by Spotify’s API to construct predictive models that discern listener preferences effectively. Using Logistic Regression and Random Forest models, we uncovered key insights into how different audio features influenced song likability. The models demonstrated a significant reliance on features such as Valence, Explicitness, GenreCount, and WordCount, which were found to be pivotal in determining Camille’s preferences.

Our analysis, supported by tools like the DALEX package and Partial Dependence Plots, provided a nuanced understanding of feature influence. We observed that Logistic Regression was particularly sensitive to changes in features like Explicitness and Valence, showing a clear trend where higher Valence and lower Explicitness increased a song’s likability. On the other hand, the Random Forest model exhibited robustness, maintaining consistent predictions across varying levels of these features, which suggests its superiority in handling non-linear interactions and complex feature relationships.

Furthermore, the exploration of temporal features such as song age showed that older songs generally gained higher likability scores, indicating a potential nostalgic influence on listener preferences. These insights not only enhance our understanding of the complexities of musical taste but also underscore the capability of machine learning to tackle intricate real-world problems. This project has not only provided a practical application of advanced machine learning techniques but also highlighted the critical role of data science in extracting meaningful insights from large-scale data, reinforcing its value in business analytics and decision-making processes today.

6.1 Limitations

Some limitations of this machine learning project include the size and composition of the dataset. We utilized a dataset comprising 200 liked and 200 disliked songs, a scope that, while functional, might be insufficient for fully capturing the breadth of factors that influence musical preferences. This limitation is particularly evident in the absence of certain musical genres, such as classical music, from the dataset. Consequently, our models were not trained on these genres, potentially compromising the generalizability of our findings across diverse musical styles. Future work could include creating a “genre” framework that could cover as many musical styles as possible allowing one to create a more diverse framework and allow for more generalizable models.

Moreover, it’s worth noting that while identifying liked songs tends to be straightforward, discovering disliked songs poses a more significant challenge, as these are generally not listened to. This difference can introduce issues in how song genres are represented within disliked playlists, potentially leading to biases in the data. For instance, achieving a balanced representation of both liked and disliked songs from the same genres could have enhanced the robustness of our model by providing a more equitable basis for comparison and analysis. Ensuring such balance is crucial for reducing bias and improving the accuracy of predictions about disliked songs.

Predicting song preferences is also inherently challenging due to the subjective nature of musical taste, which can vary significantly based on situational contexts such as the listener’s mood or environment. These factors underscore the difficulties in achieving precise predictions and highlight the need for cautious interpretation of the results when applying the developed predictive models beyond the specific dataset used in this study.

Finally, the models were specifically designed and tuned to predict the musical preferences of a single individual, Camille from our team. This customization involved selecting and tuning hyperparameters to optimize performance based on her unique taste profile (i.e., datasets), which featured clearly distinguishable preferences across certain musical genres. While this approach yielded favorable results within the scope of our project, it poses significant limitations for the external generalization of our models and findings. The models were tailored to the nuances of Camille’s musical preferences and may not perform effectively for another listener with different tastes or a more eclectic genre preference. For instance, if another user has a less defined or more varied musical taste than Camille, the predictive accuracy of our models could diminish significantly. This limitation highlights the challenge in creating universally applicable models in music preference prediction, underscoring the importance of considering individual differences when applying these machine learning techniques in broader, more diverse contexts.

6.2 References

  • 1.10. Decision Trees. (n.d.). In Scikit-learn. Retrieved 19 May 2024, from https://scikit-learn/stable/modules/tree.html

  • 3.1. Cross-validation: Evaluating estimator performance—Scikit-learn 1.4.2 documentation. (n.d.). Retrieved 17 May 2024, from https://scikit-learn.org/stable/modules/cross_validation.html

  • (18) A Nice Beheading for Mom—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=DuAT2kW4eQo&ab_channel=MonumentofMisanthropy-Topic

  • (18) Artemas—I like the way you kiss me (official music video)—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/

  • (18) Bakar—Hell N Back (Official Video)—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=BdrNvQ4YCng&ab_channel=BakarVEVO

  • (18) Billie Holiday—“Strange Fruit” Live 1959 [Reelin’ In The Years Archives]—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=-DGY9HvChXk&ab_channel=ReelinInTheYears66

  • (18) Djadja & Dinaz—J’fais mes affaires [Clip Officiel]—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=v2o3in-Aud0&ab_channel=Djadja%26Dinaz

  • (18) Eminem—Not Afraid (Lyrics)—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=anUz77ElBK4&ab_channel=RapCity

  • (18) Joni Mitchell—Big Yellow Taxi (Official Lyric Video)—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=2595abcvh2M&ab_channel=JoniMitchell

  • Achy Breaky Heart—Wikipedia. (n.d.). Retrieved 8 May 2024, from https://en.wikipedia.org/wiki/Achy_Breaky_Heart

  • Agrawal, S. (2020). What is Hyperparameter Tuning (Cross-Validation and Holdout Validation) and Model Selection. In Medium. https://medium.com/@sanidhyaagrawal08/what-is-hyperparameter-tuning-cross-validation-and-holdout-validation-and-model-selection-a818d225998d

  • ajaymehta. (2023). Detecting and Preventing Data Leakage in Machine Learning: In Medium. https://medium.com/@dancerworld60/detecting-and-preventing-data-leakage-in-machine-learning-4bb910900ab7

  • Arjmand, H.-A., Hohagen, J., Paton, B., & Rickard, N. S. (2017). Emotional Responses to Music: Shifts in Frontal Brain Asymmetry Mark Periods of Musical Change. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.02044

  • Bakar—Hell N Back (Lyrics) ft. Summer Walker—YouTube. (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=ArOX0wJP3Y8&ab_channel=VibeMusic Bakar—Hell N Back (Official Video). (n.d.). Retrieved 13 May 2024, from https://www.youtube.com/watch?v=BdrNvQ4YCng

  • Bayesian Hyperparameter Optimization: Basics & Quick Tutorial. (n.d.). Retrieved 12 May 2024, from https://www.run.ai/guides/hyperparameter-tuning/bayesian-hyperparameter-optimization

  • Basavaraju, A., Du, J., Zhou, F., & Ji, J. (2019). A Machine Learning Approach to Road Surface Anomaly Assessment Using Smartphone Sensors. IEEE Sensors Journal, PP, 1–1. https://doi.org/10.1109/JSEN.2019.2952857

  • BHADAURIYA, R. (2021). Lasso ,Ridge & Elastic Net Regression: A Complete Understanding (2021). In Medium. https://medium.com/@creatrohit9/lasso-ridge-elastic-net-regression-a-complete-understanding-2021-b335d9e8ca3

  • Bobbitt, Z. (2021). Cohen’s Kappa Statistic: Definition & Example. In Statology. https://www.statology.org/cohens-kappa-statistic/

  • Brownlee, J. (2021). No Free Lunch Theorem for Machine Learning. In MachineLearningMastery.com. https://machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/

  • cck3. (2022). Best practice for encoding datetime in machine learning. In Cross Validated. https://stats.stackexchange.com/q/311494

  • Chen, L. (2019). Support Vector Machine—Simply Explained. In Medium. https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496

  • Choosing a Model for Binary Classification Problem by Andrii Gozhulovskyi Medium. (n.d.). Retrieved 17 May 2024, from https://medium.com/@andrii.gozhulovskyi/choosing-a-model-for-binary-classification-problem-f211f7a4e263

  • Classifying genres in R using Spotify data. (2019). In Kaylin Pavlik. https://www.kaylinpavlik.com/classifying-songs-genres/

  • Cowboy Casanova. (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Cowboy_Casanova&oldid=1216997399

  • Dalpiaz, D. (n.d.). Chapter 10 Logistic Regression R for Statistical Learning. Retrieved 19 May 2024, from http://daviddalpiaz.github.io/r4sl/

  • Data Leakage and its effect on Machine Learning models by Swetha Medium. (n.d.). Retrieved 19 May 2024, from https://medium.com/@swethac42/data-leakage-and-its-effect-on-machine-learning-models-67c8edc588d4

  • Data science—K-Fold Cross-Validation How Many Folds? - Stack Overflow. (n.d.). Retrieved 17 May 2024, from https://stackoverflow.com/questions/51455186/k-fold-cross-validation-how-many-folds

  • DEI, M. (2019). Three Model Explanability Methods Every Data Scientist Should Know. In Medium. https://towardsdatascience.com/three-model-explanability-methods-every-data-scientist-should-know-c332bdfd8df

  • Dey, R. (2023). Understanding Principal Component Analysis (PCA). In Medium. https://medium.com/@roshmitadey/understanding-principal-component-analysis-pca-d4bb40e12d33

  • DiFrancesco, V. (2021). A Guide to Picking the Appropriate Scoring Metric for Your Machine Learning Classifier. In The Startup. https://medium.com/swlh/a-guide-to-picking-the-appropriate-scoring-metric-for-your-machine-learning-classifier-8e7bce9c6ae8

  • Does it matter? (n.d.). Retrieved 17 May 2024, from https://cran.r-project.org/web/packages/cvms/vignettes/picking_the_number_of_folds_for_cross-validation.html

  • Eby, M. (2019). Hyperparameters. In Analytics Vidhya. https://medium.com/analytics-vidhya/hyperparameters-80cb4f442e5

  • Elitsa Kaloyanova. (n.d.). What Is Principal Components Analysis? Data Science. Retrieved 17 May 2024, from https://365datascience.com/tutorials/python-tutorials/principal-components-analysis/

  • ES, S. (2022). 7 Cross-Validation Mistakes That Can Cost You a Lot [Best Practices in ML]. In Neptune.ai. https://neptune.ai/blog/cross-validation-mistakes

  • Evaluating Model Performance: A Comprehensive Guide by Zhong Hong Medium. (n.d.). Retrieved 10 May 2024, from https://medium.com/@zhonghong9998/evaluating-model-performance-a-comprehensive-guide-6f5e7c11409f

  • Faculty of Business and Economics (HEC Lausanne). (n.d.). Retrieved 6 May 2024, from https://www.unil.ch/hec/en/home.html

  • Fast Eddie—Yo Yo Get Funky. (1988). https://www.discogs.com/release/2750-Fast-Eddie-Yo-Yo-Get-Funky

  • GitHub—Spotipy-dev/spotipy: A light weight Python library for the Spotify Web API. (n.d.). Retrieved 6 May 2024, from https://github.com/spotipy-dev/spotipy

  • Grace (Jeff Buckley album)—Wikipedia. (n.d.). Retrieved 8 May 2024, from https://en.wikipedia.org/wiki/Grace_(Jeff_Buckley_album)

  • GridTest. (n.d.). In Gridtest. Retrieved 8 May 2024, from https://vsoch.github.io/gridtest/gridtest/
    Gupta, M. (2024). Understanding Partial Dependence Plots (PDPs). In Data Science in your pocket. https://medium.com/data-science-in-your-pocket/understanding-partial-dependence-plots-pdps-415346b7e7f1

  • Gusarova, M. (2023). Logistic Regression Model Tuning (Python Code). In Medium. https://medium.com/@data.science.enthusiast/logistic-regression-tune-hyperparameters-python-code-fintech-does-it-bring-any-value-619e172565e6

  • Halo (Beyoncé song). (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Halo_(Beyonc%C3%A9_song)&oldid=1222787762

  • herhuf. (2017). ROC curve for different hyperparameters of RandomForestClassifier? In Data Science Stack Exchange. https://datascience.stackexchange.com/q/23637

  • Hong, Z. (2023). Evaluating Model Performance: A Comprehensive Guide. In Medium. https://medium.com/@zhonghong9998/evaluating-model-performance-a-comprehensive-guide-6f5e7c11409f

  • How to create New Features using Clustering‼ by Gowtham Dongari Towards Data Science. (n.d.). Retrieved 8 May 2024, from https://towardsdatascience.com/how-to-create-new-features-using-clustering-4ae772387290

  • Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (n.d.). A Practical Guide to Support Vector Classification.

  • Hyperparameter tuning. (2019). In GeeksforGeeks. https://www.geeksforgeeks.org/hyperparameter-tuning/

  • Hyperparameter tuning overview BigQuery. (n.d.). In Google Cloud. Retrieved 18 May 2024, from https://cloud.google.com/bigquery/docs/hp-tuning-overview

  • Igareta, A. (2021). Stratified Sampling: You May Have Been Splitting Your Dataset All Wrong. In Medium. https://towardsdatascience.com/stratified-sampling-you-may-have-been-splitting-your-dataset-all-wrong-8cfdd0d32502

  • Kashyap, K. (2020). Machine Learning- Decision Trees and Random Forest Classifiers. In Analytics Vidhya. https://medium.com/analytics-vidhya/machine-learning-decision-trees-and-random-forest-classifiers-81422887a544

  • kevin_theinfinityfund. (2021). Answer to ‘Best practice for encoding datetime in machine learning’. In Cross Validated. https://stats.stackexchange.com/a/550016

  • Logistic Regression Model Tuning with scikit-learn—Part 1 by Finn Qiao Towards Data Science. (n.d.). Retrieved 10 May 2024, from https://towardsdatascience.com/logistic-regression-model-tuning-with-scikit-learn-part-1-425142e01af5

  • Loudness—Everything You Need To Know Production Expert. (n.d.). Retrieved 12 May 2024, from https://www.production-expert.com/production-expert-1/loudness-everything-you-need-to-know

  • LUFS 101: What Are They & Why Are They So Important? (2023). https://unison.audio/lufs/ Machine Learning- Decision Trees and Random Forest Classifiers by Karan Kashyap Analytics Vidhya Medium. (n.d.). Retrieved 18 May 2024, from https://medium.com/analytics-vidhya/machine-learning-decision-trees-and-random-forest-classifiers-81422887a544

  • Machine learning—Does the optimal number of trees in a random forest depend on the number of predictors? - Cross Validated. (n.d.). Retrieved 12 May 2024, from https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-random-forest-depend-on-the-number-of-pred

  • Making Predictive Models Robust: Holdout vs Cross-Validation. (n.d.). In KDnuggets. Retrieved 10 May 2024, from https://www.kdnuggets.com/making-predictive-model-robust-holdout-vs-cross-validation

  • MLBA - S24—Ensemble Methods. (n.d.). In MLBA - S24. Retrieved 18 May 2024, from https://do-unil.github.io/mlba/lectures/06_Ensembles/ML_Ensemble.html

  • MLBA - S24—Interpretable ML. (n.d.). In MLBA - S24. Retrieved 18 May 2024, from https://do-unil.github.io/mlba/lectures/07_InterpretableML/ML_Interp.html

  • Molnar, C. (n.d.). 5.2 Logistic Regression Interpretable Machine Learning. Retrieved 12 May 2024, from https://christophm.github.io/interpretable-ml-book/logistic.html

  • Monstercat Uncaged. (2024). Ace Aura—Doorman [Monstercat Release]. https://www.youtube.com/watch?v=CMZlqpFi0G4

  • Nijkamp, R. (n.d.). Prediction of product success: Explaining song popularity by audio features from Spotify data.

  • Normalized Nerd. (2021). Decision Tree Classification Clearly Explained! https://www.youtube.com/watch?v=ZVR2Way4nwQ

  • Occam’s Razor in Machine Learning: Examples—Analytics Yogi. (n.d.). Retrieved 12 May 2024, from https://vitalflux.com/occams-razor-in-machine-learning-examples/

  • Oshiro, T., Perez, P., & Baranauskas, J. (2012). How Many Trees in a Random Forest? 7376. https://doi.org/10.1007/978-3-642-31537-4_13

  • Pandian, S. (2022). A Comprehensive Guide on Hyperparameter Tuning and its Techniques. In Analytics Vidhya. https://www.analyticsvidhya.com/blog/2022/02/a-comprehensive-guide-on-hyperparameter-tuning-and-its-techniques/

  • (PDF) How Many Trees in a Random Forest? (n.d.). Retrieved 12 May 2024, from https://www.researchgate.net/publication/230766603_How_Many_Trees_in_a_Random_Forest

  • (PDF) What is the best segment duration for music mood analysis ? (n.d.). Retrieved 17 May 2024, from https://www.researchgate.net/publication/4351507_What_is_the_best_segment_duration_for_music_mood_analysis

  • Peterson, H. (2021). Halpeter/Danceability-Analysis. https://github.com/halpeter/Danceability-Analysis

  • Principal Component Analysis (PCA) in R Tutorial DataCamp. (n.d.). Retrieved 18 May 2024, from https://www.datacamp.com/tutorial/pca-analysis-r

  • R: Control parameters for train. (n.d.). Retrieved 10 May 2024, from https://search.r-project.org/CRAN/refmans/caret/html/trainControl.html

  • Radiohead. (2016). How to Disappear Completely. https://www.youtube.com/watch?v=6W6HhdqA95w

  • RPubs—SVM with CARET. (n.d.). Retrieved 18 May 2024, from https://rpubs.com/uky994/593668

  • Selvaraj, N. (2022). Logistic Regression Explained in 7 Minutes. In Towards Data Science. https://towardsdatascience.com/logistic-regression-explained-in-7-minutes-f648bf44d53e

  • Shaboozey. (2024). A Bar Song (Tipsy). https://www.youtube.com/watch?v=lUj1Wjs7Hdg Spotify - Web Player: Music for everyone. (n.d.). In

  • Spotify. Retrieved 6 May 2024, from https://open.spotify.com/ Spotify User Stats (Updated March 2024). (n.d.). Retrieved 6 May 2024, from https://backlinko.com/spotify-users

  • Spotify Wrapped 2023: ‘Music genres are now irrelevant to fans’. (2023). https://www.bbc.com/news/entertainment-arts-67111517

  • The best way to encode dates, times, and other cyclical features. (n.d.). Retrieved 17 May 2024, from https://harrisonpim.com/blog/the-best-way-to-encode-dates-times-and-other-cyclical-features

  • Train Test Validation Split: How To & Best Practices [2023]. (n.d.). Retrieved 12 May 2024, from https://www.v7labs.com/blog/train-validation-test-set, https://www.v7labs.com/blog/train-validation-test-set

  • Unique Leader Records. (2020). INGESTED - ‘No Half Measures’ (Official Lyric Video). https://www.youtube.com/watch?v=BZDOkVg8UF8

  • Variable Importance Plots—An Introduction to the vip Package. (n.d.). Retrieved 19 May 2024, from https://cran.r-project.org/web/packages/vip/vignettes/vip.html

  • Web API Spotify for Developers. (n.d.). Retrieved 6 May 2024, from https://developer.spotify.com/documentation/web-api

  • Welcome to Spotipy! —Spotipy 2.0 documentation. (n.d.). Retrieved 6 May 2024, from https://spotipy.readthedocs.io/en/2.22.1/

  • What do the audio features mean? - Spot On Track Help center. (n.d.). Retrieved 11 May 2024, from https://help.spotontrack.com/article/what-do-the-audio-features-mean

  • What is Hyperparameter Tuning? Domino Data Lab. (n.d.). Retrieved 18 May 2024, from https://domino.ai/data-science-dictionary/hyperparameter-tuning

  • What’s new in DALEX and DALEXtra. The DALEX package version 2.0 was… by Szymon Maksymiuk ResponsibleML Medium. (n.d.). Retrieved 12 May 2024, from https://medium.com/responsibleml/whats-new-in-dalex-and-dalextra-a75e5cebff0e

  • When did The Game release “How We Do”? (n.d.). Retrieved 8 May 2024, from https://genius.com/The-game-how-we-do-lyrics/q/release-date

  • White, M. (2017). Answer to ‘Best practice for encoding datetime in machine learning’. In Cross Validated. https://stats.stackexchange.com/a/311498

  • Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893

  • Xiao, Z., Dellandréa, E., Dou, W., & Chen, L. (2008). What is the best segment duration for music mood analysis ? 17–24. https://doi.org/10.1109/CBMI.2008.4564922

  • Y.M.C.A. (song). (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Y.M.C.A._(song)&oldid=1222757157

  • You Don’t Know Me (Armand Van Helden song). (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=You_Don%27t_Know_Me_(Armand_Van_Helden_song)&oldid=1219319210

7 Appendix

7.1 Exploratory Data Analysis

7.1.1 Distribution and Comparison of Variables by Liked Status (Density/Histogram Plots) - All Features

7.2 Supervised Learning Results

7.2.1 Baseline Model Scores on Validation Set

Baseline Model Performance Comparison
Performance evaluated on Validation Set
Metric Naive Logistic Regression Decision Tree SVM Linear Random Forest
Accuracy 0.500 0.766 0.641 0.750 0.703
AccuracyLower 0.372 0.643 0.511 0.626 0.576
AccuracyUpper 0.628 0.862 0.757 0.850 0.811
Kappa 0.000 0.531 0.281 0.500 0.406
McnemarPValue 0.000 0.606 0.404 0.211 0.359
Sensitivity 1.000 0.812 0.719 0.656 0.625
Specificity 0.000 0.719 0.562 0.844 0.781
PosPredValue 0.500 0.743 0.622 0.808 0.741
NegPredValue NaN 0.793 0.667 0.711 0.676
Precision 0.500 0.743 0.622 0.808 0.741
Recall 1.000 0.812 0.719 0.656 0.625
F1_Score 0.667 0.776 0.667 0.724 0.678
Prevalence 0.500 0.500 0.500 0.500 0.500
DetectionRate 0.500 0.406 0.359 0.328 0.312
DetectionPrevalence 1.000 0.547 0.578 0.406 0.422
BalancedAccuracy 0.500 0.766 0.641 0.750 0.703

7.2.2 Baseline vs Tuned Model Scores on Validation Set

Baseline vs Tuned Model Performance Comparison
Performance evaluated on Validation Set
Metric Baseline Logistic Regression Logistic Regression Baseline Random Forest Random Forest
Accuracy 0.766 0.781 0.703 0.766
AccuracyLower 0.643 0.660 0.576 0.643
AccuracyUpper 0.862 0.875 0.811 0.862
Kappa 0.531 0.562 0.406 0.531
McnemarPValue 0.606 0.423 0.359 0.302
Sensitivity 0.812 0.844 0.625 0.844
Specificity 0.719 0.719 0.781 0.688
PosPredValue 0.743 0.750 0.741 0.730
NegPredValue 0.793 0.821 0.676 0.815
Precision 0.743 0.750 0.741 0.730
Recall 0.812 0.844 0.625 0.844
F1_Score 0.776 0.794 0.678 0.783
Prevalence 0.500 0.500 0.500 0.500
DetectionRate 0.406 0.422 0.312 0.422
DetectionPrevalence 0.547 0.562 0.422 0.578
BalancedAccuracy 0.766 0.781 0.703 0.766

7.2.3 Tuned Model Scores on Test Set

Train vs. Test Model Performance Comparison
Performance comparasion between Validation and Test Sets
Metric Naive Model Test Logistic Regression Training Logistic Regression Test Random Forest Training Random Forest Test
Accuracy 0.500 0.781 0.787 0.766 0.812
AccuracyLower 0.386 0.660 0.682 0.643 0.710
AccuracyUpper 0.614 0.875 0.871 0.862 0.891
Kappa 0.000 0.562 0.575 0.531 0.625
McnemarPValue 0.000 0.423 0.628 0.302 0.302
Sensitivity 1.000 0.844 0.825 0.844 0.875
Specificity 0.000 0.719 0.750 0.688 0.750
PosPredValue 0.500 0.750 0.767 0.730 0.778
NegPredValue NaN 0.821 0.811 0.815 0.857
Precision 0.500 0.750 0.767 0.730 0.778
Recall 1.000 0.844 0.825 0.844 0.875
F1_Score 0.667 0.794 0.795 0.783 0.824
Prevalence 0.500 0.500 0.500 0.500 0.500
DetectionRate 0.500 0.422 0.412 0.422 0.438
DetectionPrevalence 1.000 0.562 0.537 0.578 0.562
BalancedAccuracy 0.500 0.781 0.787 0.766 0.812

7.2.4 Partial Dependence Plots

7.3 References